Utilization of Michael Jordan’s game statistics for estimating average point scoring
The project objective
The main objective of the analysis is to examine the impact of other statistics on Michael Jordan’s point scoring across 9 NBA regular seasons and to create a predictive model explaining the quantity of points scored per game.
Dataset
To create the dataset and conduct the analysis, statistics from each of the 9 NBA seasons were utilized, which were then merged into a single main data frame. The statistics were sourced from www.basketball-reference.com, a website that houses statistics, results, and histories of the NBA, ABA, WNBA leagues, as well as top European competitions.
| Date | AH | Opp | WL | MP | PTS | FG | P2A | FGA | P3 | P3A | FT | FTA | ORB | DRB | TRB | AST | STL | BLK | TOV | PF |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1984-10-26 | H | WSB |
W | 40 | 16 | 5 | 16 | 16 | 0 | 0 | 6 | 7 | 1 | 5 | 6 | 7 | 2 | 4 | 5 | 2 |
| 1984-10-27 | A | MIL |
L | 34 | 21 | 8 | 13 | 13 | 0 | 0 | 5 | 5 | 3 | 2 | 5 | 5 | 2 | 1 | 3 | 4 |
| 1984-10-29 | H | MIL |
W | 34 | 37 | 13 | 24 | 24 | 0 | 0 | 11 | 13 | 2 | 2 | 4 | 5 | 6 | 2 | 3 | 4 |
| 1984-10-30 | A | KCK |
W | 36 | 25 | 8 | 21 | 21 | 0 | 0 | 9 | 9 | 2 | 2 | 4 | 5 | 3 | 1 | 6 | 5 |
| 1984-11-01 | A | DEN |
L | 33 | 17 | 7 | 15 | 15 | 0 | 0 | 3 | 4 | 3 | 2 | 5 | 5 | 1 | 1 | 2 | 4 |
| 1984-11-07 | A | DET |
W | 27 | 25 | 9 | 19 | 19 | 0 | 0 | 7 | 9 | 1 | 3 | 4 | 3 | 3 | 1 | 5 | 5 |
| 1984-11-08 | A | NYK |
W | 33 | 33 | 15 | 22 | 22 | 0 | 0 | 3 | 4 | 4 | 4 | 8 | 5 | 3 | 2 | 5 | 2 |
| 1984-11-10 | A | IND |
W | 42 | 27 | 9 | 22 | 22 | 0 | 0 | 9 | 12 | 2 | 7 | 9 | 4 | 2 | 5 | 3 | 4 |
| 1984-11-13 | H | SAS |
W | 43 | 45 | 18 | 26 | 27 | 1 | 1 | 8 | 11 | 2 | 8 | 10 | 4 | 3 | 2 | 4 | 4 |
| 1984-11-15 | H | BOS |
L | 33 | 27 | 12 | 23 | 24 | 0 | 1 | 3 | 3 | 0 | 2 | 2 | 2 | 2 | 1 | 1 | 4 |
The table presents statistics of games played by Jordan in regular seasons starting from his rookie season in 1984 until the year 1992, during which the team won the NBA championship for the third consecutive time (Three-peat).
Description of the headers:
- Date - The date of the game in the format year, month, and day (YYYY-MM-DD)
- AH - Information about where the game was played - “home” or “away”
- A - Away
- H - Home
- Team - The team against which the Chicago Bulls played on the current day
- WL - Game resultat
- W - Win
- L - Loss
- MP - Minutes played
- PTS - The number of scored points
- FG - Field goal
- FGA - Field goal attempts
- TP - Three pointer
- TPA - Three pointer attempts
- FT - Free throw
- P2A - Two pointer attempts
- FTA - Free throw attempts
- ORB - Offensive rebounds
- DRB - Defensive rebounds
- TRB - Total rebounds
- AST - Assists
- STL - Steals
- BLK - Blocks
- TO - Turnovers
- PF - Personal fouls
The presented data frame has the following structure:
| Data type | |
|---|---|
| Date | Date |
| AH | character |
| Opp | character |
| WL | character |
| MP | numeric |
| PTS | numeric |
| FG | numeric |
| P2A | numeric |
| FGA | numeric |
| P3 | numeric |
| P3A | numeric |
| FT | numeric |
| FTA | numeric |
| ORB | numeric |
| DRB | numeric |
| TRB | numeric |
| AST | numeric |
| STL | numeric |
| BLK | numeric |
| TOV | numeric |
| PF | numeric |
The dataset consists of 667 observations and contains 21 columns.
In the dataset, there are 3 columns with character data type, 1 column with date data type, and the remaining values are of numerical type. The data does not contain any missing or NA values.
Basic statistics
| Characteristic | N = 6671 |
|---|---|
| PTS | 32 (8,69,9) |
| FG | 12.1 (3.0,27.0,3.7) |
| P2A | 22.0 (7.0,44.0,5.8) |
| FGA | 23 (7,49,6) |
| P3 | 0.43 (0.00,7.00,0.83) |
| P3A | 1.43 (0.00,12.00,1.58) |
| FT | 7.6 (0.0,26.0,4.0) |
| FTA | 9.0 (0.0,27.0,4.5) |
| ORB | 1.70 (0.00,8.00,1.48) |
| DRB | 4.63 (0.00,13.00,2.59) |
| TRB | 6.3 (0.0,18.0,3.1) |
| AST | 5.90 (1.00,17.00,2.79) |
| STL | 2.72 (0.00,10.00,1.71) |
| BLK | 1.03 (0.00,6.00,1.11) |
| TOV | 3.01 (0.00,8.00,1.75) |
| PF | 2.91 (0.00,6.00,1.37) |
| 1 Mean (Minimum,Maximum,SD) | |
Based on the above table, we were unable to deduce any initial insights from visual analysis that could help in model construction.
Charts
| Minimum | Mean | Median | Sum | Maximum |
|---|---|---|---|---|
| 8 | 32.3 | 32 | 21541 | 69 |
The variable PTS (points scored) is a key factor in the model-building process and serves as the dependent variable of the constructed model. The range of values for this variable spans from a minimum of 8 to a maximum of 69, indicating the diversity of achieved results. The median, which establishes the central point of the distribution, is 32, with a close mean of 32.3.
The skewness of the points variable, measuring the asymmetry of the distribution, is 0.346. A positive skewness value indicates that the tail of the distribution of points scored extends more to the right than to the left, suggesting a probability of achieving high point scores. Right-skewness may result from irregular cases that inflate the mean - for example, 10 games in which Jordan scored 54 or more points.
The kurtosis of the points variable, measuring the peakedness of the distribution, is 0.386. The kurtosis value is moderately positive, suggesting that the distribution is slightly flatter compared to a normal distribution.
The above-presented plot demonstrates a clear positive correlation between the number of minutes played and points scored. In other words, the longer the player participates in the game, the more points they tend to score. This suggests that playing time is one of the key factors influencing the scoring efficiency of our player.
In the plot depicting the relationship between points scored and assists given, we observe a characteristic shape of the trend line, resembling a flattened lowercase ‘m’, occurring in the range of y values from 31 to 33, corresponding to points scored.
This suggests that for the majority of cases (considering assists), the relationship between the number of assists and points scored is limited or diminishes. The concentration of y values between 31 and 33 indicates some maintenance of points scored, regardless of the number of assists. This may suggest that, for this specific analysis, the number of assists is not a key factor influencing scoring.
From the analysis of the plot, it can be observed that the 0.5 and 0.75 quantiles for away games are lower than for home games. This means that in half of the cases and in the upper quartile, the player scores fewer points when playing away. However, the 0.25 quantiles remain at the same level, suggesting that the lower quartile of scoring does not significantly differ between the two locations.
| Type | Q0.25 | Median | Q0.75 |
|---|---|---|---|
| Away | 26 | 32 | 37 |
| Home | 26 | 33 | 38 |
After conducting the calculations, it was found that the values of points scored in the 0.5, 0.75 quantiles, and median are on average one point higher in the case of away games, indicating potentially better scoring efficiency during away matches.
The analysis of three plots depicting the relationship between points scored and different shot categories (free throws, mid-range shots, and three-point shots) reveals clear, positive correlations between the number of attempted shots and points scored in each of these categories. All three plots demonstrate a strong relationship, suggesting that shooting efficiency significantly impacts team scoring.
The conclusion drawn is that all three analyzed shot categories likely have a strong impact on the final point outcome, indicating that they will be highly statistically significant variables during the construction of a predictive model.
In a situation where the number of shots from different positions is significantly diverse, this can impact the model, especially if a particular type of shot is more or less valuable in terms of scoring points in a game.
In this case, mid-range shots have a significantly higher count than other categories, and the model may tend to more accurately consider the influence of mid-range shots. However, the mere fact that one category is more numerous does not automatically mean that it will have a greater impact on the model.
On the presented plot depicting the relationship between points scored in a game and the number of offensive rebounds, interesting trends can be observed. As the number of offensive rebounds increases, we observe a slight increase in points, suggesting a positive correlation between these two variables.
When the number of rebounds is 1, a slight increase is observed, confirming the impact of even a single offensive rebound on the point outcome. Then, with 2 rebounds, a bump to around 33 points is observed, followed by a slight decrease to 32.
However, from 4 rebounds onwards, a clear increase in the trend line can be seen, although it is worth noting that the confidence interval of the regression line for this area significantly widens. This suggests that as the number of offensive rebounds increases, this variable becomes a less certain predictor of the point outcome.
As the number of turnovers increases, we observe a fluctuating character of the plot, generally hovering around 32.5 (32-33) points. However, we notice a slight decrease in points scored after reaching 5 turnovers, and from 7 turnovers onwards, we observe a sharp increase in the confidence interval of the trend line. This suggests that the number of turnovers may have a limited impact on points scored, and due to the widening confidence interval for higher values, it can be inferred that the variable of turnovers may be statistically insignificant in the predictive model.
The regression line rises almost at a 45-degree angle, indicating a positive and nonlinear relationship between the variables. An interesting aspect is the flattening of the regression line at 3 steals. This may suggest that initial steals contribute to an increase in points scored, but after reaching a certain level, additional steals have less impact on the point outcome - for values from 4 to 7 steals, the number of points scored is almost constant. An increase in the number of steals beyond this range is associated with further increases in points, forming a convex curve. This suggests that the number of steals may be a significant explanatory factor in the model, but its impact is nonlinear.
The correlation table of variables
From the preliminary analysis of the correlation matrix, we can observe that 6 variables are statistically significant. A clear positive correlation (0.71) characterizes the variable P2A - “Mid-range shots attempted”, and (0.57) FTA - “Free throw attempts”, indicating a strong relationship between attempts of 2-point shots and the number of free throw attempts with the points scored. A moderate positive correlation (0.44) “minutes played” - MP also suggests that the more time spent on the court, the tendency for a higher number of points scored.
The remaining 3 variables - Steals, Offensive rebounds, Three-point attempts - have low positive correlations (0.15-0.20) with the dependent variable, indicating that their influence will not be as significant in the constructed model.
Three other variables are considered statistically insignificant, meaning their impact on the number of points scored is not statistically significant.
Model construction
Using stepwise regression and backward regression, we managed to create one model that explained the variable PTS using the variables AH, MP, FTA, P2A, P3A, ORB. Utilizing our own insights and knowledge that Michael Jordan was known for a significant number of steals in played matches, we also created a model with the variable STL. After conducting analyses, which are not included in the report, we decided not to add the variable STL to the model because it was statistically insignificant, and to remove the variable ORB because its interpretation caused some confusion. Below is a comparison of these 3 models using measures of model quality and criteria.
Call:
lm(formula = PTS ~ AH + MP + FTA + P2A + P3A, data = RS)
Residuals:
Min 1Q Median 3Q Max
-15.5509 -3.1272 0.0035 3.2000 19.7824
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.20300 1.40985 2.272 0.023415 *
AHH 1.33433 0.38218 3.491 0.000513 ***
MP -0.09970 0.04338 -2.298 0.021847 *
FTA 0.88403 0.04507 19.615 < 0.0000000000000002 ***
P2A 1.01380 0.03958 25.611 < 0.0000000000000002 ***
P3A 1.37890 0.12239 11.266 < 0.0000000000000002 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 4.9 on 661 degrees of freedom
Multiple R-squared: 0.7172, Adjusted R-squared: 0.7151
F-statistic: 335.3 on 5 and 661 DF, p-value: < 0.00000000000000022
We can see that all estimators of structural parameters are statistically significant. Additionally, the coefficient of determination is 0.7172, which means that approximately 71% of the variability in PTS is explained by the independent variables. The standard error in the model is 4.9. This means that, on average, the model can be off by 4.9 points scored per game.
PTS = 3.20 + 1.33*AHH - 0.10*MP + 0.88*FTA + 1.01*P2A + 1.38*P3A
Model diagnostics
Normal Q-Q plot
In the considered model, observations do not deviate from the straight line, indicating that we can assume the residuals are normally distributed.
Residuals vs Fitted plot
In the considered model, on the Residuals vs Fitted plot, we can observe a straight line, indicating that the linear relationship has been explained by the model and has not been omitted in the residuals.
Scale-Location plot
On the above plot, it can be seen that the red curve is close to the horizontal line, and the square roots of standardized residuals are evenly distributed around the red line. Therefore, the assumption of homoscedasticity of residuals may be satisfied. It is recommended to verify this observation using an appropriate statistical test to further confirm the hypothesis of homoscedasticity of the model.
RESET test
data: mdl
RESET = 0.31075, df1 = 2, df2 = 659, p-value = 0.733
The obtained p-value in the RESET test, which is 0.733, suggests no evidence of nonlinearity.
Linear independence
AH MP FTA P2A P3A
1.014452 1.540117 1.157166 1.451581 1.031757
1-5: Moderate collinearity - no significant issues
Homoscedasticity
Goldfeld-Quandt test
data: mdl
GQ = 0.76987, df1 = 328, df2 = 327, p-value = 0.9909
alternative hypothesis: variance increases from segment 1 to 2
In the conducted Goldfeld-Quandt test on the model under consideration, the p-value reaches 0.9909, suggesting that the assumption of constant variance of linear regression is met in the model.
Autocorrelation of errors
Durbin-Watson test
data: mdl
DW = 2.017, p-value = 0.5751
alternative hypothesis: true autocorrelation is greater than 0
Breusch-Godfrey test for serial correlation of order up to 2
data: mdl
LM test = 5.4617, df = 2, p-value = 0.06516
The Durbin-Watson test and the Breusch-Godfrey test for serial correlation up to order 2 have p-values of 0.5751 and 0.6 respectively. These p-values are greater than 0.05, suggesting that there is not enough evidence to conclude that there is autocorrelation in the model residuals.
Linearity of errors
Shapiro-Wilk normality test
data: mdl$residuals
W = 0.9981, p-value = 0.6766
The Shapiro-Wilk test conducted does not reject the hypothesis of normality of the model residuals (p-value > 0.05).
Model predictions
On March 31, 1989, Michael Jordan played a home game against the Cleveland Cavaliers. He scored 37 points, making 7 free throw attempts, 27 mid-range shots, and 2 three-pointers, all in 43 minutes of play. Despite MJ’s excellent statistics, the Chicago Bulls ended up losing the game.
| AH | MP | FTA | P2A | P3A |
|---|---|---|---|---|
| H | 43 | 7 | 27 | 2 |
| fit | lwr | upr |
|---|---|---|
| 36.56912 | 26.92057 | 46.21767 |
In the second of the testing matches, the built predictive model, on November 12, 1989, Michael Jordan played against the New Jersey Nets (now Brooklyn Nets) in an away game, resulting in a negative outcome as the Chicago Bulls lost.
Jordan played for 43 minutes during which he scored 42 points. To achieve this result, he made 12 free throw attempts, 28 mid-range shots, and 3 three-pointers.
| AH | MP | FTA | P2A | P3A |
|---|---|---|---|---|
| A | 43 | 12 | 28 | 3 |
| fit | lwr | upr |
|---|---|---|
| 42.04767 | 32.39495 | 51.70038 |
On April 12, 1991, Michael Jordan played an away game against the Detroit Pistons. He scored 40 points, making 15 free throw attempts, 22 mid-range shots, and 2 three-pointers, all in 43 minutes of play. Despite MJ’s excellent statistics, the Chicago Bulls ended up losing the game.
| AH | MP | FTA | P2A | P3A |
|---|---|---|---|---|
| A | 43 | 15 | 22 | 2 |
| fit | lwr | upr |
|---|---|---|
| 37.23804 | 27.58526 | 46.89082 |